Conversation
* main: (22 commits) feat: add gemini-acp provider, update docs on subscription models + improvements to codex (#8000) fix(openai): use Responses API for gpt-5.4 (#7982) Remove lead/worker provider (#7989) chore(release): release version 1.28.0 (#7991) Fix empty tool results from resource content (e.g. auto visualiser) (#7866) Separate SSE streaming from POST work submission (#7834) fix: include token usage in Databricks streaming responses (#7959) Optimize tool summarization (#7938) fix: overwrite the deprecated googledrive extension config (#7974) refactor: remove unnecessary Arc<Mutex> from tool execution pipeline (#7979) Revert message flush & test (#7966) docs: add Remote Access section with Telegram Gateway documentation (#7955) fix: update webmcp blog post metadata image URL (#7967) fix: clean up OAuth token cache on provider deletion (#7908) fix: hard-coded tool call id in code mode callback (#7939) Fix SSE parsers to accept optional space after data: prefix (#7929) docs: add GOOSE_INPUT_LIMIT to config-files.md (#7961) Add WebMCP for Beginners blog post (#7957) Fix download manager (#7933) Improve the formatting of tool calls, show thinking, treat Reasoning and Thinking as the same thing (sorry Kant) (#7626) ...
* main: (37 commits) fix: handle reasoning content blocks in OpenAI-compat streaming parser (#8078) chore(acp): build native packages on latest mac (#8075) Display delegate sub agents logs in UI (#7519) Update tar version to avoid CVE-2026-33056 (#8073) refactor: consolidate duplicated dependencies into workspace (#8041) tui: set up for publishing via github actions (#8020) feat: feature-gate local inference dependencies (#7976) feat: ability to manage sub recipes in desktop ui (#6360) Tweak the release process: no more merge to main (#7994) fix: gemini models via databricks (#8042) feat(apps): Pass toolInfo to MCP Apps via hostContext (#7506) fix: remove configured marker when deleting oauth provider configuration (#7887) docs: add vmware-aiops MCP extension documentation (#8055) Show setup instructions for ACP providers in settings modal (#8065) deps: replace sigstore-verification with sigstore-verify to kill vulns (#8064) feat(acp): add session/set_config and stabilize list, delete and close (#7984) docs: Correct `gosoe` typo to `goose` (#8062) fix: use default provider and model when provider in session no longer exists (#8035) feat: add GOOSE_SHELL env var to configure preferred shell (#7909) fix(desktop): fullscreen header bar + always-visible close controls (#8033) ...
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b1fc2bf84b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| STEP=$((i + 1)) | ||
| echo "[$STEP/$TOTAL] agent-browser ${GLOBAL_ARGS[*]} ${ARGS[*]} --timeout $CMD_TIMEOUT_MS" | ||
| if ! timeout "$CMD_TIMEOUT" pnpm exec agent-browser "${GLOBAL_ARGS[@]}" "${ARGS[@]}"; then |
There was a problem hiding this comment.
Respect command-specific timeouts in replay
replay.sh hard-caps every agent-browser command to 10 seconds via timeout "$CMD_TIMEOUT" ..., so recorded steps that intentionally request longer waits (for example wait --timeout 60000/120000 in the new extension recordings) will still be killed at 10s. This makes long-running LLM/tool steps fail even when the recording specifies a larger timeout, causing deterministic false failures in the new E2E workflow.
Useful? React with 👍 / 👎.
| // In e2e test mode, auto-confirm dialogs so CDP-based tests can proceed | ||
| // without needing to interact with native OS dialogs. | ||
| if (process.env.ENABLE_PLAYWRIGHT) { | ||
| return { response: 1 }; |
There was a problem hiding this comment.
Return the actual affirmative dialog response in E2E mode
In ENABLE_PLAYWRIGHT mode this handler always returns { response: 1 }, but several callers treat index 0 as the confirm action (e.g., KeyboardShortcutsSection conflict/reset dialogs). That means these dialogs are auto-cancelled instead of auto-confirmed during E2E runs, so test behavior diverges from user behavior and some flows become untestable.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3bef76bc45
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| PORT=$((9300 + RANDOM % 100)) | ||
| if ! lsof -i :"$PORT" &>/dev/null; then | ||
| echo "$PORT" |
There was a problem hiding this comment.
Reserve CDP ports atomically across parallel workers
pick_port does a check-then-use (lsof then echo) on a shared 9300–9399 range, but e2e-run-all.sh launches multiple tests concurrently (xargs -P "$WORKERS"). Two workers can select the same still-unbound port, which makes one session connect to the wrong Electron instance or fail nondeterministically once one process binds first. This creates cross-test contamination and flaky CI outcomes in the new parallel runner.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
addressed this
| if ! CDP_PORT=$(wait_for_app "$TEST_NAME"); then | ||
| local DURATION=$(( SECONDS - START_TIME )) | ||
| echo "[$TEST_NAME] FAIL — app did not start within 30s (${DURATION}s)" | ||
| echo "FAIL ${DURATION}s" > "$STATUS_DIR/$TEST_NAME" | ||
| return |
There was a problem hiding this comment.
Always stop the test session on startup timeout
When app startup does not become ready within 30s, run_one returns early before reaching the teardown call (e2e-stop.sh). Since the app was launched in a detached screen session, this path can leave orphaned Electron/goosed processes and stale session artifacts, which then interfere with subsequent recordings in the same run (e.g., port conflicts and residual state).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2dcb343a45
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if [[ -n "$CDP_PORT" ]]; then | ||
| lsof -ti :"$CDP_PORT" 2>/dev/null | xargs kill -9 2>/dev/null || true | ||
| pkill -9 -f "remote-debugging-port=$CDP_PORT" 2>/dev/null || true |
There was a problem hiding this comment.
Stop goosed cleanly instead of SIGKILLing Electron
This teardown path force-kills the CDP listener with kill -9, which bypasses Electron shutdown hooks that normally call goosedResult.cleanup() in main.ts. Because goosed is spawned as goosed agent (no session path in argv), the later pattern kill in e2e-stop.sh does not reliably target it, so tests can leave orphan backend processes running and contaminate subsequent runs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
kill -9 is intentional to avoid hanging during cleanup. Orphan goosed processes are swept by pkill -9 -f "$BASE_DIR" at the end of e2e-stop.sh. We haven't seen orphan contamination in practice. Will revisit if it becomes an issue.
| trap 'echo ""; echo "Interrupted, stopping active E2E sessions..."; cleanup_and_exit 130' INT TERM | ||
|
|
||
| # Run recordings in parallel with worker limit | ||
| printf '%s\n' "${RECORDINGS[@]}" | xargs -P "$WORKERS" -I {} bash -c "run_one '{}' '$STATUS_DIR'" || true |
There was a problem hiding this comment.
Gate parallel workers when single-instance lock applies
The runner always executes recordings in parallel, but the app still enforces app.requestSingleInstanceLock() on non-macOS in main.ts. On Linux/Windows, only one Electron instance can remain alive, so additional workers will fail startup and be reported as test failures; this needs a platform guard (or forcing --workers 1) to avoid deterministic false negatives outside macOS.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
we only supports macos at the moment
| timeout "$TIMEOUT" bash "$SCRIPT_DIR/replay.sh" \ | ||
| "$RECORDING" \ |
There was a problem hiding this comment.
Respect scenario time budgets before test-level timeout
Each recording is wrapped in a fixed timeout "$TIMEOUT" (default 120s), but new recordings include waits up to 120s after earlier long waits (for example, extension-install-playwright.batch.json has 60s + 60s + 120s waits). This causes the harness to terminate otherwise valid long-running scenarios before their own declared timeouts elapse, producing avoidable TIMEOUT failures.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
reduced timeout in the test
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8281334588
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const goosedResult = await startGoosed({ | ||
| serverSecret, | ||
| dir: dir || os.homedir(), | ||
| dir: process.env.GOOSE_WORKING_DIR || dir || os.homedir(), |
There was a problem hiding this comment.
Preserve requested chat directory over env default
createChat now always prefers process.env.GOOSE_WORKING_DIR over the per-window dir argument, so any flow that passes a specific directory (e.g., deep links/new-window creation with dir) is silently ignored whenever that env var is set. This regresses directory-scoped behavior in those launches: users/tests can request a different folder, but startGoosed still starts in the env folder. Restricting the env override to initial defaulting (or to explicit E2E mode) avoids breaking dir-driven window creation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0430864b99
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| pkill -9 -f "$BASE_DIR" 2>/dev/null || true | ||
| pkill -9 -f 'agent-browser-chrome' 2>/dev/null || true | ||
| pkill -9 -f 'agent-browser-darwin\|agent-browser-linux' 2>/dev/null || true |
There was a problem hiding this comment.
Use ERE alternation when killing agent-browser helpers
pkill patterns are interpreted as regular expressions (BSD/FreeBSD pgrep/pkill docs describe these as extended regex), so alternation should be |, not \|. With the current pattern, this command matches a literal pipe and does not match agent-browser-darwin or agent-browser-linux, leaving helper processes alive after teardown; those leftovers can interfere with later E2E runs on the same machine.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c0c9c54d77
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| rm -f "$BASE_DIR/.port-locks/$CDP_PORT" | ||
| fi | ||
|
|
||
| pkill -9 -f "$SESSION_DIR" 2>/dev/null || true |
There was a problem hiding this comment.
Kill goosed explicitly for per-session teardown
This teardown path does not reliably terminate the backend process after force-killing Electron. stop_session only pattern-kills by "$SESSION_DIR", but startGoosed launches goosed as goosed agent with the session path only in cwd (not argv), so that match can miss; and in the normal runner path we call e2e-stop.sh "$TEST_NAME", which exits before the global fallback sweeps at the bottom of the script. Fresh evidence: ui/desktop/src/goosed.ts uses spawnArgs = ['agent'] and cwd: workingDir, so argv-based pkill -f "$SESSION_DIR" is not a reliable cleanup mechanism. When this miss happens, orphan goosed processes can leak into later tests and cause cross-test contamination/flakes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 278f91582c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| uses: Swatinem/rust-cache@42dc69e1aa15d09112580998cf2ef0119e2e91ae # v2 | ||
|
|
||
| - name: Install GNU timeout (if missing) | ||
| run: command -v timeout || brew install coreutils |
There was a problem hiding this comment.
Make timeout available on macOS before running E2E
On macos-latest, this step only runs brew install coreutils when timeout is missing, but the replay harness later calls timeout directly for every command. Homebrew coreutils installs gtimeout unless the gnubin path is added, so in the common case where timeout is absent the E2E run can still fail with timeout: command not found before tests execute. Please either export the coreutils gnubin directory (or symlink) here, or update the scripts to fall back to gtimeout.
Useful? React with 👍 / 👎.
| gh run download {run_id} --repo block/goose --name e2e-test-results --dir /tmp/e2e-results 2>&1 | ||
| # example: result_url: https://github.com/block/goose/actions/runs/23487840554/artifacts/6079302235 | ||
| # gh run download 23487840554 --repo block/goose --name e2e-test-results --dir /tmp/e2e-results 2>&1 | ||
| rm -rf ui/desktop/tests/e2e-tests/results-ci | ||
| unzip -o /tmp/e2e-results.zip -d ui/desktop/tests/e2e-tests/results-ci |
There was a problem hiding this comment.
Drop nonexistent unzip step after gh artifact download
When results_url is provided, these instructions call gh run download ... --dir /tmp/e2e-results and then immediately unzip /tmp/e2e-results.zip. gh run download already extracts artifacts into the target directory, so this zip file is not produced; the extra unzip command will fail and block Step 1 for CI-artifact debugging runs.
Useful? React with 👍 / 👎.
Summary
Testing
Related Issues
Relates to #ISSUE_ID
Discussion: LINK (if any)
Screenshots/Demos (for UX changes)
Before:
After: